-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ABGLSV-Pornin multiplication #323
base: main
Are you sure you want to change the base?
Conversation
This is really cool! A few comments / questions based on a quick read of the source code:
|
I did consider this API, but wasn't sure whether there were any cases where we would want to not use cofactor multiplication on the result for
Yep, this is also what I'd like. I kept the prior API initially so I had something to benchmark against 😄
It looks like in (Curve9697) @pornin uses width-4 for runtime-calculated tables, and width-5 for pre-computed tables. IDK if he has relevant benchmarks, but it's another datapoint towards dropping width-8. I think it would make sense to examine this in a subsequent PR, separate to this change.
The input preparation is performance-critical, in that the pre-Pornin algorithms were slow enough that the reduction in doublings could not offset it (which led to the Ed25519 paper dismissing ABGLSV and using double-base scalar mult instead). That said, a I had originally started writing
Yes, this is what I mean. We want to leverage the fact that bit lengths are strictly-decreasing, to avoid operating on higher limbs that are guaranteed to not contain the MSB. |
9b8b93d
to
b401033
Compare
I've reworked the PR following @hdevalence's comments, and added an AVX2 backend.
|
About window sizes: there are several parameters in play, not all of which apply to the present case; notably, my default implementations strive to work on very small systems, and that means using very little RAM. For Curve9767, each point in a window uses 80 bytes (affine coordinates, each coordinate is a polynomial of degree less than 19, coefficients on 16 bits each, two dummy slots for alignment); if the windows collectively contain 16 points (for instance), then that's 1280 bytes of stack space, and for the very low-end of microcontrollers, that's too much (I must leave a few hundred bytes for the temporaries used in field element operations, and the calling application may also have needs). ROM/Flash size is also a constraint (though usually less severe), again encouraging using relatively small windows. With a window of n bits, 2n-1 points must be stored (e.g. for a 16-bit window, this stores points P, 2P,... 8P, from which we can also dynamically obtain -P, -2P,... -8P). If using wNAF, we only need the odd multiplies of these points (i.e. P, 3P, 5P and 7P for a 16-bit window), lowering the storage cost to 2n-2 points. In the signature verification, I have two dynamic windows to store: computing uA+vB+wC, with B being the generator but A and C dynamically obtained, I need one window for A and another for C. Therefore, if I want to use only 8 points (640 stack bytes), then I must stick to 4-bit windows. Static windows are in ROM, and there's more space there, but there's exponential growth; each 5-bit window is 1280 bytes, and there are two of them, so 2560 bytes of ROM for these. In the x86/AVX2 implementation, for signature verification, I use 5-bit dynamic windows, and 7-bit static windows; for generic point multiplication (non-NAF, thus with also the even multiples), I have both static and dynamic 5-bit windows (four static windows for the base point). The static windows add up to 10240 bytes, which I think is a reasonable figure for big x86, since there will typically be about 32 kB of L1 cache: again, we must think that the caller also has data in cache, and if we use up all the L1 cache for the signature verification, this may look well on benchmarks, but in practical situations this will induce cache misses down the line. We should therefore strive to use only a minority of the total L1 cache. Note that Ed25519 points are somewhat larger than Curve9767 points: About CPU cost: this is a matter of trade-offs. In wNAF, with n-bit windows, building the window will require 2n-2-1 point additions, and will induce on average one point addition every n+1 bits. With 127-bit multipliers, this means that 4-bit windows need 28.4 point additions on average (for each window, not counting the 126 doublings), while 5-bit windows need about 28.2. With Curve9767, the latter is better (if you have the RAM) for another reason which is not applicable to Ed25519: long sequences of point doublings are slightly more efficient, and longer windows increase the average length of runs of doublings. This benefit does not apply to Ed25519. Thus, for dynamic windows and Ed25519, I'd say that 4-bit and 5-bit wNAF windows should be about equivalent (5-bit windows would be better if using 252-bit multipliers). With static windows, there is no CPU cost in building windows, and larger windows are better, but there are diminishing returns. Going from 7-bit to 8-bit windows would save less than two point additions, possibly not worth the effort unless you are aiming at breaking the record in a microbenchmark context which will be meaningless in real situations. |
Force-pushed to fix the serial and vector Straus impls, which were not correctly checking for the first non-zero |
2194715
to
93fd6ea
Compare
This PR was previously based on release 2.0.0. I've rebased it onto Assuming this PR is merged, I plan to make a separate PR to |
8b6c6c1
to
e267f48
Compare
Fixed the simple bugs, but CI is still failing because between 2.0.0 and |
81534c1
to
f6ec510
Compare
src/backend/vector/avx2/edwards.rs
Outdated
]) | ||
.unwrap()), | ||
); | ||
println!("b_shl_128_odd_lookup_table = {:?}", b_shl_128_odd_table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this test to match the one for BASEPOINT_ODD_LOOKUP_TABLE
, and used it to regenerate the AVX2 B_SHL_128_ODD_LOOKUP_TABLE
table (the contents of which apparently gets generated differently after over 2 years of crate development, but both the old and new lookup tables pass tests).
src/backend/vector/ifma/edwards.rs
Outdated
|
||
let basepoint_odd_table = | ||
NafLookupTable8::<CachedPoint>::from(&constants::ED25519_BASEPOINT_POINT); | ||
println!("basepoint_odd_lookup_table = {:?}", basepoint_odd_table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from the AVX2 tests to have an equivalent check of the AVX512IFMA lookup table, but I don't have a suitable device to test this.
src/backend/vector/ifma/edwards.rs
Outdated
]) | ||
.unwrap()), | ||
); | ||
println!("b_shl_128_odd_lookup_table = {:?}", b_shl_128_odd_table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from the AVX2 tests to have an equivalent check of the AVX512IFMA lookup table. Someone with a suitable device needs to run this test and extract the output of this println
so we can update ifma::constants
with the correct table.
src/backend/vector/ifma/constants.rs
Outdated
@@ -2060,3 +2060,2031 @@ pub(crate) static BASEPOINT_ODD_LOOKUP_TABLE: NafLookupTable8<CachedPoint> = Naf | |||
), | |||
])), | |||
]); | |||
|
|||
/// Odd multiples of `[2^128]B`. | |||
// TODO: generate real constants using test in `super::edwards`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently just a duplicate of BASEPOINT_ODD_LOOKUP_TABLE
to get the build CI checks to pass.
Moving the todo list out of the top post:
The last two items are not blockers for this PR. |
d785f10
to
d448edd
Compare
Uses Algorithm 4 from Pornin 2020 to find a suitable short vector. References: - Pornin 2020: https://eprint.iacr.org/2020/454
d448edd
to
b13b3a6
Compare
Force-pushed to fix post-rebase bugs and get CI passing. |
a3524dc
to
20e355e
Compare
Force-pushed to add changelog entries and fix documentation. |
/// Checks whether \\([8a]A + [8b]B = [8]C\\) in variable time. | ||
/// | ||
/// This can be used to implement [RFC 8032]-compatible Ed25519 signature validation. | ||
/// Note that it includes a multiplication by the cofactor. | ||
/// | ||
/// [RFC 8032]: https://tools.ietf.org/html/rfc8032 | ||
pub fn vartime_check_double_scalar_mul_basepoint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ed25519-dalek
is now in the same workspace as curve25519-dalek
, so I can make changes to it in this PR, but I think the next question is how we use this method.
I opened this PR in May 2020. Originally I just returned the scalar mul output directly, but @hdevalence suggested this "check" API instead, where the EdwardsPoint
version would multiply by the cofactor. I migrated to that, noting that we might want to make the cofactor multiplication configurable.
In October 2020 @hdevalence published his survey of Ed25519 validation criteria. Some time in the intervening 3.5 years, ed25519-dalek
has gained several separate signature verification methods, that all use this helper function internally:
curve25519-dalek/ed25519-dalek/src/verifying.rs
Lines 216 to 224 in cc3421a
// Helper function for verification. Computes the _expected_ R component of the signature. The | |
// caller compares this to the real R component. If `context.is_some()`, this does the | |
// prehashed variant of the computation using its contents. | |
// Note that this returns the compressed form of R and the caller does a byte comparison. This | |
// means that all our verification functions do not accept non-canonically encoded R values. | |
// See the validation criteria blog post for more details: | |
// https://hdevalence.ca/blog/2020-10-04-its-25519am | |
#[allow(non_snake_case)] | |
fn recompute_R<CtxDigest>( |
These helpers are therefore either checking "ad-hoc" or "strict" equality of R, neither of which multiply by the cofactor. Meanwhile the ed25519-zebra
crate implements the ZIP 215 signature validation rules, which are the "expansive" rules (R is not required to be a canonical encoding, and multiplication by cofactor is required).
So I think we do want some kind of configurability here over the cofactor multiplication. What should this look like? A boolean argument, or two separate APIs?
Note also that the scalar mul optimization implemented in this PR actually checks [δa]A + [δb]B = [δ]C
, where δ
is a value invertible mod
If there is on the curve a non-trivial point
T
of orderh
, then replacingR
withR+T
will make
the standard verification equation fail, but the second one will still accept the signature if it
so happens that the valueδ
(obtained from the lattice basis reduction algorithm) turns out to
be a multiple ofh
.
Is there a way we can avoid this by adjusting the lattice basis reduction algorithm to filter out these δ
values? If not, then we cannot use this optimisation for the "strict" verification methods, and it is debatable whether we should even use it for the "ad-hoc" methods (as doing so would change the ill-defined set of valid signatures - not that there isn't already wide inconsistencies between implementations here already, but this would be a difference between two versions of curve25519-dalek
, and IDK what the maintainers' policy here is).
Regardless, we definitely should offer a "mul-by-cofactor" version of this in the API, as curve25519-zebra
(and anyone else using the cofactor check equation) will benefit from it (as will anyone using RistrettoPoint
in a signature scheme, which fortunately does not suffer from this problem).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have looked a bit at the problem, I think leveraging the optimization while being strictly equivalent to the cofactorless equation is doable, but it is a bit unpleasant.
We have a public key A
, generator is B
, signature is (R, s)
, and during the verification, the challenge k
is computed as a SHA-512 output, which is then interpreted as an integer. The curve has order 8*L
. Points A
and R
are on the curve, but not necessarily in the subgroup of order L
. The cofactorless verification equation is:
s*B - k*A = R
First, we should note that while k
is nominally a 512-bit integer, the implementation in curve25519-dalek represents k as a Scalar
, which implies reduction modulo L
. This already deviates from the cofactorless equation in RFC 8032, where there is no such reduction. This matters if A
is not in the subgroup of order L
; for instance, it may happen that k
is, as an integer, a multiple of 8, while k mod L
is an odd integer, in which case the cofactorless equation would report a success, while the dalek implementation would reject it. The reverse is also possible (signature accepted by dalek but rejected by the RFC). All these variants are still within the scope of the signature algorithm, i.e. the discrepancies between verifier behaviours do not allow actual signature forgeries by attackers not knowing the private key. There is some extra discussion in the Taming the many EdDSAs paper (page 11). Here I am discussing reproducing the exact behaviour of the current dalek implementation, and therefore I call k
the reduction of the SHA-512 output modulo L
.
Given k
, one can compute k8 = k mod 8
(the three low bits of k
). The cofactorless equation is then equivalent to:
s*B - ((k - k8)/8)*(8*A) - (R + k8*A) = 0
Thus, by replacing k
, A
and R
with, respectively, (k >> 3)
, 8*A
and R + (k & 7)*A
, I have a completely equivalent equation (thus with the same behaviour), but I have also guaranteed that the A
point is in the proper subgroup of order L
. Thus we can now assume that A
is in that subgroup. This is important: when multiplying A
by an integer x
, we can now reduce x
modulo L
without any loss of information.
When we apply Lagrange's algorithm on the lattice basis ((k, 1), (L, 0))
, we get a new basis ((u0, u1), (v0, v1))
for the same lattice. In algorithm 4 in my paper, we stop as soon as the smaller of these two vectors is "small enough", but we can also reuse the stopping condition from algorithm 3, i.e. we can change this:
if len(N_v) <= t:
return (v, u)
into:
if len(N_v) <= t:
if 2*abs(p) <= N_v:
return (v, u)
This would, on average, add maybe one or two iterations to the algorithm, i.e. the extra cost on the algorithm would likely be negligible. By using this test, we ensure that not only v
is truly the smallest non-zero vector in the lattice, but u
is the second smallest non-zero vector among those which are not colinear to v
(this kind of assertion breaks down at higher lattice dimensions, but in dimension 2 it works).
Now, Lagrange's algorithm starts here with u = (k, 1)
, and 1 is odd. Moreover, each step either adds a multiple of v
to u
, or a multiple of u
to v
. The consequence is that u1
and v1
can never both be even; at least one of them is odd. The important point here is that if v1
is an odd integer, and less than L
(by construction), then it is invertible modulo L
(since L
is prime) but also modulo 8 (since it is odd). Thus, v1
is invertible modulo 8*L
. If v1
is invertible modulo 8*L
, which is the whole curve order, then we can multiply the verification equation by v1
in a reversible way, i.e. without changing the behaviour. We thus get:
(v1*s mod L)*B - v0*A - v1*R = 0
which is the Antipa et al optimization. Note that the equivalence relies on two properties: that A
is in the right subgroup (so that we can replace k*v1
with v0
), and that v1
is odd.
The unpleasantness is that v1
might be even. As explained above, if v1
is even, then u1
must be odd, hence we can use (u0, u1)
instead of (v0, v1)
. However, the smallest non-zero vector in the lattice is v
, not u
. Heuristically, u
is not much bigger than v
, but there are some degenerate cases. For instance, if k = (L - 1)/2
, then the output of Lagrange's algorithm is v = (1, -2)
(very small, but -2 is even), and u = (2*(L+1)/5, (L-4)/5)
(denominator u1 = (L-4)/5
is odd, but both u0
and u1
are almost as large as L
).
In the verification algorithm, k
is an output of SHA-512, and thus attackers would have trouble crafting signatures that leverage the most degenerate cases, and we can heuristically consider that u
won't be a very large vector, but the lattice reduction algorithm must still performs update on u
and v
with their full 254-bit size (including the sign bit); the nice trick of computing them only over 128 bits is no longer applicable. This may conceivably increase the cost, and thus decrease the usefulness of the optimization.
Summary: the behaviour of the current implementation (with the cofactorless equation) can be maintained while applying the Antipa et al optimization, provided that the following process is applied:
- Compute
k
as previously, with a SHA-512 output and with reduction moduloL
(to maintain backward compatibility). - Replace
k
,A
andR
withk >> 3
,8*A
andR + (k & 7)*A
, respectively. - Compute Lagrange's algorithm over
((k, 1), (L, 0))
(optionally with the extra ending test so that a truly size-reduced basis is obtained, to make both basis vectors as small as possible). Updates to coordinates ofu
andv
must be maintained over their full size (254 bits). - Given the output
((v0, v1), (u0, u1))
of Lagrange's algorithm (with(v0, v1)
being the smallest non-zero vector in the lattice), use(v0, v1)
ifv1
is odd; but ifv1
turns out to be even, use(u0, u1)
instead (in that case,u1
is odd). Sinceu
is not the smallest vector, its coordinates can be larger thansqrt(1.16*L)
, so the combined Straus algorithm must be able to handle large coefficients (even if these are improbable in practice).
WARNING: I wrote all this without actually implementing it. It seems to make sense on paper, but until it is implemented and tested, there's no guarantee I did not make a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pornin for looking into this! I think the proposed changes are complex enough that they should be made and tested in a separate PR.
To avoid blocking this PR further, I propose that we rename EdwardsPoint::vartime_check_double_scalar_mul_basepoint
to something like EdwardsPoint::vartime_check_double_scalar_mul_basepoint_cofactor
, and then in a subsequent PR we can attempt to expose an EdwardsPoint::vartime_check_double_scalar_mul_basepoint
that is cofactor-less.
20e355e
to
fd8952c
Compare
Force-pushed to move the new generated serial tables into separate submodules, and added cfg-flagged tests to generate them, and a CI job that verifies them. If this works, I'll attempt to replicate this for the vector tables. |
0d8eea8
to
a66efb2
Compare
This corresponds to the signature verification optimisation presented in Antipa et al 2005. It uses windowed non-adjacent form Straus for the multiscalar multiplication. References: - Antipa et al 2005: http://cacr.uwaterloo.ca/techreports/2005/cacr2005-28.pdf
Checks whether [8a]A + [8b]B = [8]C in variable time. This can be used to implement RFC 8032-compatible Ed25519 signature validation. Note that it includes a multiplication by the cofactor.
Checks whether [a]A + [b]B = C in variable time.
a66efb2
to
5e03d5c
Compare
Force-pushed to fix the Fiat backends, and adjust the new CI check to fail if the table generators do nothing (as they generate output that is incorrectly formatted, and thus detectable). |
5e03d5c
to
c96c810
Compare
Force-pushed to implement a similar kind of generator approach for the AVX2 vector table. It doesn't currently work because the |
c96c810
to
a77e13b
Compare
Force-pushed to fix the AVX2 table generator. The generated constant is concretely different from before (I presume something changed about the wNAF implementation in the intervening four years), but tests pass before and after the change (and I checked that mutating either version of the constant causes a test to fail). |
a77e13b
to
01a9e9e
Compare
Force-pushed to implement a generator for the IFMA vector table, based on the working AVX2 generator. It should work, but I don't have the hardware to run it, and so the IFMA constants remain invalid. Someone with compatible hardware needs to run the following commands on this branch:
and then provide the resulting diff to the IFMA table. |
Adds a backend for computing
δ(aA + bB - C)
in variable time, where:B
is the Ed25519 basepoint;δ
is a value invertible modℓ
, which is selected internally to the function.This corresponds to the signature verification optimisation presented in Antipa et al 2005. It uses Algorithm 4 from Pornin 2020 to find a suitable short vector, and then windowed non-adjacent form Straus for the resulting multiscalar multiplication.
References: